AITopics | 1-bit quantization

Collaborating Authors

1-bit quantization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

1579d5d8edacd85ac1a86aea28bdf32d-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 18:28:10 GMT

artificial intelligence, machine learning, qqp rte sst-2 sts-b avg, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

0e230b1a582d76526b7ad7fc62ae937d-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-7-2026, 11:39:01 GMT

More extensive and thorough experiments are needed. Sub 1-bit quantization is only available through FleXOR. Or do some weights use >1b while other can use much less? The reviewer did not find results in the paper that used quantized inputs. "Input weight format" should read "Internal weight format."

1-bit quantization, artificial intelligence, quantization, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.33)

Add feedback

0e230b1a582d76526b7ad7fc62ae937d-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 01:17:21 GMT

1-bit quantization, artificial intelligence, quantization, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.33)

Add feedback

CommVQ: Commutative Vector Quantization for KV Cache Compression

Li, Junyan, Zhang, Yang, Hassan, Muhammad Yusuf, Chafekar, Talha, Cai, Tianle, Ren, Zhile, Guo, Pengsheng, Karimzadeh, Foroozan, Reed, Colorado, Wang, Chong, Gan, Chuang

arXiv.org Artificial IntelligenceJun-24-2025

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.

large language model, machine learning, quantization, (17 more...)

arXiv.org Artificial Intelligence

2506.18879

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

Add feedback

Frame Quantization of Neural Networks

Czaja, Wojciech, Na, Sanghoon

arXiv.org Machine LearningApr-11-2024

Quantization is the process of compressing input from a continuous or large set of values into a small-sized discrete set. It gained popularity in signal processing, where one of its primary goals is obtaining a condensed representation of the analogue signal suitable for digital storage and recovery. Examples of quantization algorithms include truncated binary expansion, pulse-code modulation (PCM) and sigma-delta (Σ) quantization. Among them, Σ algorithms stand out due to their theoretically guaranteed robustness. Mathematical theories were developed in several seminal works [3-5, 8, 11], and have been carefully studied since, e.g., [14, 15, 19, 27]. In recent years, the concept of quantization also captured the attention of the machine learning community. The quantization of deep neural networks (DNNs) is considered one of the most effective network compression techniques [9]. Computers express parameters of a neural network as 32-bit or 64-bit floating point numbers.

algorithm, neural network, quantization, (13 more...)

arXiv.org Machine Learning

2404.08131

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback